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Report of the Odyssey FPGA Independent Assessment Team 

Abstract: An independent assessment team (IAT) was formed and met on April 2, 

2001, at Lockheed Martin in Denver, Colorado, to aid in understanding a technical issue 
for the Mars Odyssey spacecraft scheduled for launch on April 7, 2001 An RP1280A 
field-programmable gate array (FPGA) from a lot of parts common to the SIRTF, 
Odyssey, and Genesis missions had failed on a SIRTF printed circuit board A second 
FPGA from an earlier Odyssey circuit board was also known to have failed and was also 
included in the analysis by the IAT. Observations indicated an abnormally high failure 
rate for flight RP1280A devices (the first flight lot produced using this flow) at Lockheed 
Martin and the causes of these failures were not determined. Standard failure analysis 
techniques were applied to these parts; however, additional diagnostic techniques unique 
for devices of this class were not used, and the parts were prematurely submitted to a 
destructive physical analysis, making a determination of the root cause of failure difficult. 

Any of several potential failure scenarios may have caused these failures, including 
electrostatic discharge, electrical overstress, manufacturing defects, board design errors, 
board manufacturing errors, FPGA design errors, or programmer errors. Several of these 
mechanisms would have relatively benign consequences for disposition of the parts 
currently installed on boards in the Odyssey spacecraft if established as the root cause of 
failure. However, other potential failure mechanisms could have more dire 
consequences. As there is no simple way to determine the likely failure mechanisms with 
reasonable confidence before Odyssey launch, it is not possible for the IAT to 
recommend a disposition for the other parts on boards in the Odyssey spacecraft based on 
sound engineering principles. 

Introduction . 

JPL management requested outside expertise to aid in understanding a technical issue for the 

Mars Odyssey spacecraft scheduled for launch on April 7, 2001. An Independent Assessment Team 
was formed for this purpose and met on Monday, April 2, 2001 (see Appendix 1 for list of attendees) 
to hear presentations relevant to analysis of these failures and to assess the failure investigation and 
recommend disposition of parts on Odyssey board. In particular, the objectives of the review and 
analysis were to review the available data, determine if all possible and meaningful actions have been 
followed, provide expert opinion on the validity of the analysis results, suggest possible failure 
mechanisms for the FPGA failures, and assess the implications of each possible failure mechanism on 
FPGAs from Odyssey flight FPGAs. Members of the Independent Assessment Team consisted of 
Donald Mayer (Chair) and Jon Osborn both of The Aerospace Corporation, Rich Katz of NASA 
Goddard Space Flight Center, and Jerry Soden of Sandia National Laboratories 

An RP1280A field programmable gate array (FPGA) failed on a SIRTF printed circuit board. 
This device was from a parts lot designated U1H466 common to the SIRTF, Odyssey, and Genesis 
missions. The RP1280A consists of a commercial 1.0 pm Matsushita A1280A die packaged by Space 
Electronics Inc. (SEI) using their package shielding technology and electrically tested, screened, and 
procured through Actel. This may provide higher levels of total dose radiation tolerance than the 
A 1280 A or RT1280A, which use the same dice with normal Actel packaging. An alternative 
pin-compatible device is the RH1280 that is produced on Lockheed Martin s 
radiation-hardened/high-reliability processing line in Manassas, Virginia. The RP1280A was chosen 
by Lockheed Martin (Denver) for cost and heritage reasons and the devices delivered by Actel for the 



SIRTF, Odyssey, and Genesis programs were the first lot of devices ever delivered by Actel using the 
SEI packaging technology 

Prior to the meeting, information made available by Lockheed Martin, JPL, and Actel was 
carefully reviewed. Lockheed Martin had concluded initially on March 16 that the most probable 
cause of failure was a particle contaminant but during the IAT meeting stated that the cause of failure 
was unknown; JPL considered the FPGA failure a random occurrence Additionally, the material 
suggested that there had been a second FPGA that had been removed from an Odyssey board due to 
anomalously high current, and that was found to be the case, and the Problem/Failure Report was 

provided. 


IAT Review Process 

The review consisted of presentations by Lockheed Martin and JPL personnel, questions by the 
IAT participants and other attendees, a part mechanical test, and a lab visit Schematic drawings of the 
board and FPGA designs were not provided to the IAT for evaluation prior to or during the meeting 
despite requests, but were provided the day prior to launch. One outcome of the review was that the 
particle initially reported as the probable cause of the SIRTF failure was most probably an artifact of 
Lockheed Martin’s processing JPL also proposed an ESD model for the cause of failure, but was 
unable to show a physical correlation with the failed flight unit. In summary, the data shows that there 
are two failed flight RP1280A FPGAs at Lockheed Martin and there is no known cause that can be 
established for either failure. 

The failure analysis methodology and its results were discussed in depth. Various analysis 
techniques were used which are common to many device types and failure analysis laboratories. 
However, the features inherent in Actel FPGA technology for internal characterization were not 
exploited’ The device architecture and support hardware and software are capable of a variety of 
diagnostic and electronic probing tests. These capabilities, when combined with traditional failure 
analysis techniques, often provide critical information for the analysis of failed devices. Both failed 
devices were destructively processed in the failure analysis laboratory before information based on 
Actel-specific non-destructive diagnostics and analyses could be obtained. 


Possible Failure Mechanisms 

The failure mechanisms for the Odyssey and SIRTF failures are not known, but some 
speculations may be made from the data presented. 

The FPGA failures involved the parts drawing excessive current. At the time the parts were 
removed, the SIRTF part had failed functionally while the Odyssey part had not. A hot spot was 
identified in the Odyssey failure. This site (in the core of the device) could have been the source of the 
high current (22 mA) observed as the failure signature for this part. However, no reproducible 
individual hot spots were identified in the SIRTF failure that might source the high currents observed, 
and no single point failure site is known that might allow the very high current levels (more than 500 
mA) observed. Therefore, it is likely that the observed leakage current in the SIRTF part is global in 
nature or distributed over a large number of discrete points. Theoretically, possible failure modes 
could be failure of one of the charge pumps, which provide the gate voltage to the pass gates that 
connect logic module inputs and outputs to the programmed interconnects, a conductive short that 
pulls a net to an intermediate voltage instead of a rail, an open or high-resistance conduction path that 
allows some critical net to float to intermediate potential, or some other unknown problem 



The lack of sufficient information about the failures precludes identification of the failure 
mechanism However, there are a number of potential failure mechanisms that might have created the 
observed failure mode Speculation of a few possibilities follows. 

ESD/EOS during handling at Lockheed-Martin 

If the parts were ESD damaged during handling, for example in the programming, assembly, 
stocking, kitting, or test areas, it would likely create a defect that would result in an immediately 
observable failure or change in part performance. It is also possible that such an event could create 
latent damage that would worsen with time, such as a metal open or short, or a gate oxide defect 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures - 
Unknown but probably low The parts might have been damaged during handling at Lockheed-Martin, 
in spite of their ESD control procedures. An ESD or EOS event can create damage in an oxide of an 
integrated circuit that would result in a hot spot during subsequent failure analysis. However, Actel 
has not observed ESD damage at the location (in the core) identified by Lockheed-Martin, no damage 
was seen in the I/Os where damage is normally observed. Lockheed-Martin has not observed ESD 
damage in prior parts handled in their facility in 15 years. 

Corrective action, if proven as failure mechanism - Improve ESD protection and awareness at 
Lockheed-Martin (e g., conductive floor mats, verify earth grounds, ground test fixtures, add ionizers if 
necessary, etc ). 

Recommendation, if proven as failure mechanism - Requalify or replace all FPGAs potentially 
exposed to ESD at Lockheed-Martin. Demonstrate reliability of remaining parts by reasonable amount 
of additional testing, screening, and analysis. 

Manufacturing defect (systemic) 

An error during manufacturing may have affected all devices on a wafer or in a wafer lot. The 
wafer ID was not available because the Action Probe was not used in the diagnostics. This might 
include overetched poly gates, incompletely etched contacts or vias, or inadequate metal step coverage 
in contacts or vias. If most defects were found, but some escaped, and worsened with time due to 
time-dependent dielectric breakdown (TDDB) or electromigration, for example, it might generate the 
observed symptoms. 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures - 
Unknown but probably low. A systemic processing error would likely have resulted in out of family 
behavior for the manufacturing lot, which does not seem to be the case here, considering history of 
parts from the same wafer lot that were not handled at Lockheed-Martin. However, if manufacturing 
process was nominal but with variations that allowed significant population with marginal defects, 
some escapes would be possible, which may explain SIRTF failure. However, based on statistics 
provided by Actel, that is not characteristic of Actel parts built at Matsushita in the 1.0-pm process. 
Gradual degradation of contact hole in global net, for example, might result in floating node over time. 
However, many or most of the contacts are redundant in these parts. Unstable current behavior 
observed at Lockheed-Martin during slow power up of failed part is consistent with floating node in 
global net, conceivably caused by gradual opening of minimally conducting contact during operation, 
or this can be an artifact of the test that is not applicable to the Actel antifuse technology; note that the 
charge pump must start for this device to be properly operational. 

Corrective action, if proven as failure mechanism - Probably none if shown to be failure 
mechanism in only one part, due to high reliability of Actel product. 

Recommendation, if proven as failure mechanism -. If shown to be failure mechanism in only 
one part, it would allow use of remaining parts as is. If shown to be the failure mechanism of both 
parts, do not use parts from this wafer lot. 


Board design and/or manufacturing error 

It is possible that an error in board design may apply input logic high levels before complete 
power on, creating electrical latchup Other possibilities include floating Mode, Vpp, Vsv, or Vks pins 
on device; the Mode pin termination was not physically checked on the board Also all unused output 
pins of Act2 are driven low by automated synthesis tools; they should be floated or tied to resistive 
loads Failure to do this might cause EOS problems that could worsen with time or with each 
repetitive operation. 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures - 
Unknown but probably low. Grounded inputs may stress buffers during power up. Such a board 
design error would likely involve high-current stress failure, such as open metal line Thorough 
inspection of board schematics would help bound risk for this failure mechanism. 

Corrective action, if proven as failure mechanism - Assurance was given that the power up 
sequence is well controlled at Lockheed-Martin This should be verified. Board should be checked for 
proper application of voltage signals during power-up and during operation. Board schematics should 
be checked for design problems. 

Recommendation, if proven as failure mechanism - Correct board design as needed Replace 
FPGAs on suspect boards. If shown to be failure mechanism in one part, it would allow use of parts on 
other boards with different designs as is; other boards with the same design, replace parts as required 

FPGA circuit design error 

It is conceivable that some circuit design problem could cause a latent error, perhaps by 
overstressing some area of the circuit during power up, for example. It is conceivable that some circuit 
design error, or error introduced by the automated synthesis tools is creating a fault that places strain 
on the part leading to degradation and failure. An example might be a timing error resulting in a bus 
conflict, such as a bi-directional pin being driven as an output by the FPGA while a bus driver is also 
driving the bus. High currents flowing in the outputs during those bus conflicts can cause accelerated 
damage. 

Likelihood that this proposed mechanism is root cause of Odyssey/SERTF part failures - 
Unknown but probably low. Design errors or uncertainties have been observed in the Odyssey 
schematics, such as grounded inputs, logic hazards at flip-flop clocks, parallel counter with high-skew 
clocks, and possible setup time, hold time, and synchronization errors. There is no direct link from 
these logic errors to the failure signatures of either part. This failure mechanism is likely to cause 
damage due to local high current stress in metal line, and would not result in hot spot observed for 
Odyssey. 

Corrective action, if proven as failure mechanism - Measure in laboratory actual data bus 
timing, and interrupt vector timing, verify that simulation testbench accurately models application. Re- 
run simulations looking for I/O driver conflicts. 

Recommendation, if proven as failure mechanism - Modify design as needed, and replace all 
affected devices. If shown to be failure mechanism in one part, it would allow use of parts with 
different design as is 

Manufacturing defect ('random') 

A random defect such as a conductive or non-conductive particle may have created a short or 
open in the circuit, which could have generated this problem. The random defect might initially have 
had no impact because it may have been only partially closed or open, but may have eventually shorted 
via TDDB, or allowed electromigration-induced open due to high current density through reduced 
area. To date, Actel testing and analysis have shown that electromigration-induced opens have not 



been a problem It is noted that the Actel metal step coverage in these parts is not out of family, but 
does not meet military standards 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures 
Unknown but probably low This failure mechanism is possible even in a well-controlled 
manufacturing process, because the particle density in even the best fab is not zero. However, it is not 
likely to occur twice in one lot. Hot spot identified in failed Odyssey part may be result of defect 
between metal traces; however defect would have to be latent, since part survived burn-in and passed 
early electrical tests. This reduces likelihood of random defect as cause of Odyssey failure. Random 
defect might create observed behavior in SERTF failure if it resulted for example in gradual 
degradation of charge pump, or in gradual opening of contact to other global net, note that many of the 
contacts are redundant and these parts are not prone to that failure mode Not credible as related 
failure mechanism for two failures. 

Corrective action, if proven as failure mechanism - Probably none. This is not a systemic 
failure mechanism. 

Recommendation, if proven as failure mechanism - If shown to be failure mechanism in one 
part, it would allow use of remaining parts as is. If shown to be the failure mechanism of both parts, 
do not use parts from this wafer lot. 

Programmer error at Lockheed-Martin 

If the programmer did not operate properly during programming operations, the part might 
have been damaged. 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures - 
Unknown but probably low The Lockheed-Martin estimated programming failure rate of 10 percent is 
abnormally high for A1280A-based Actel FPGAs. On April 3, 2001, an Actel field application 
engineer ran the calibration test with Lockheed-Martin personnel and no problems were found. 
Additionally no problems were found during the adapter module test. 

Corrective action, if proven as failure mechanism - Repair or replace programming unit. 

Recommendation, if proven as failure mechanism - Remove all FPGAs programmed with this 
programming device from flight units. 

ESD/EOS during handling outside Lockheed-Martin 

If the parts were ESD damaged during handling at SEI, Actel, or other location, it may have 

created a latent defect that would worsen with time. 

Likelihood that this proposed mechanism is root cause of Odyssey/SIRTF part failures - 
Unknown but probably low. Lifetest data on similar parts programmed at Actel show no failures. 
There are no reports of failures of other parts from this wafer lot. 

Corrective action, if proven as failure mechanism - improve ESD protection and awareness at 
facilities outside Lockheed-Martin or find other vendors for each manufacturing step. 

Recommendation, if proven as failure mechanism - Perform part screening and analysis as 
appropriate to determine acceptability of remaining parts. Requalify or replace all FPGAs from this 
lot. 


Kev Findings of the IAT 

1. Two parts out of 218 from the flight lot have failed on boards at Lockheed Martin despite having in 
place high-reliability space qualification parts screening and handling processes 

2. Failure analysis at Lockheed-Martin and Actel has not identified root causes of these failures. 



3. The failure rate calculated by Actel for the A1280A part type is below 10 FITs based on Actel 
lifetest data, which is characteristic of a high-reliability part. There is no evidence of a 

fundamental reliability problem with the Actel A 1280 A FPGA. 

4 The failure rate for parts on boards at Lockheed-Martin is significantly higher than predicted from 
Actel reliability data and experience. The failure rates at Lockheed-Martin for these IIP 1280A 
devices and for the A 1280 A parts for the Titan program are two orders of magnitude higher than 
the Actel data predicts 

5. Circuit design errors have been identified in the Odyssey FPGA, but the likelihood that they are the 

root cause of the failures is believed to be low. _ 

6 Several possible failure mechanisms have been suggested for each failure, but until mechanisms for 
these failures are identified with high confidence, use of these parts on Lockheed-Martin boards 
entails an unknown level of risk. 


VyWllVlUJlUlIJ _ 1 . St C* 

Observations indicate an abnormally high failure rate for flight RP1280A devices (the first 
flight lot produced using this flow) at Lockheed Martin and the causes of these failures have not been 
determined. Standard failure analysis techniques were applied to these parts; however, additional 
diagnostic techniques unique for devices of this class were not used, and the parts were prematurely 
submitted to a destructive physical analysis, making a determination of the root cause of failure 
difficult. 


Several potential failure mechanisms have been suggested by Lockheed Martin, JPL, and the 
IAT as discussed above. Several of these mechanisms would have relatively benign consequences for 
disposition of the parts currently installed on boards in the Odyssey spacecraft if established as the root 
cause of failure. However, other potential failure mechanisms have more dire consequences. As there 
is no simple way to determine the likely failure mechanisms with reasonable confidence before 
Odyssey launch, it is not possible for the IAT to recommend a disposition for the other parts on boards 
in the Odyssey spacecraft based on sound engineering principles. 
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